Contents

  • The Challenge of Unstructured Modeling
  • Using Graphs to Describe Model Structure
    • Directed Models
    • Undirected Models
    • The Partition Function
    • Energy-Based Models
    • Separation and D-Separation
    • Converting between Undirected and Directed Graphs
    • Factor Graphs
  • Sampling from Graphical Models (Generating Sample using estimated distribution)
  • Advantages of Structured Modeling
  • Learning about Dependencies
  • Inference and Approximate Inference
  • The Deep Learning Approach to Structured Probabilistic Models (IMPORTANT!!!)
    • Example: Restricted Boltzmann Machine

Introduction

  • A Structured probabilistic model is a way of describing a probability distribution,using a graph to describe which random variables in the probability distribution interact with each other directly.
    • Graph: Vretices, Edges
  • Deep learning Practitionors tend to use very different model structures, learning algorithms and inference procedures

16.1 The Challenge of Unstructured Modeling

  • Tasks using Probabilistic Models are often more expensive than classification.
    • Multiple Output Values
    • c.f. Classification produces a single output, and ignores many parts of the input
  • Probabilistic Models => a complete understanding of the entire structure of the input
    • Density Estimation
      • Base-
    • Denoising
      • Robust to noise
    • Missing value imputation
      • Impute a Probability over unobserved data
    • Sampling
      • Generates new samples from the distribution
  • Requirements for the models
    • Memory for representation
      • c.f. Naive Distribution: n dimension with k states for each entry => $O(k^n)$
    • Statistical Efficiency
      • the number of model parameters $\uparrow$, the required amount of training data $\uparrow$
    • The cost of inference
    • The cost of sampling
  • Structured probabilistic models provide a formal framework for modeling only direct interactions between random variables.
    • Fewer parameters
    • Reduce computational cost
      • storing the model, inference, sampling

16.2 Using Graphs to Describe Model Structure

16.2.1 Directed Models

  • Known as Belief Network, Bayesian Network
  • Directed Edge means conditional distribution
    • e.g. relay run $p(t_0, t_1, t_2) = p(t_0)p(t_1|t_0)p(t_2|t_1)$
  • General Case

    • $p(\mathbf{x}) = \Pi_i p(x_i| Pa_\mathcal{G}(x_i))$
      • $Pa_\mathcal{G}(x_i)$ means 'Pa'rent nodes of $x_i$ in $\mathcal{G}$.
  • n discrete variables each having k values

    • Unstructured Models: $O(k^n)$ parameters
    • Structured Models: $O(k^m)$ parameters (m is the maximum number of variables in a single conditional p(...).
      • few parents in the graph -> few paramters. -> efficient
  • Graph encoding
    • "Which variables are Conditionally independent from each other."

16.2.2 Undirected Models

  • Known as Markov Random Fields(MRFs), Markov Networks
  • Use when the interactions seem to have no intrinsic direction or to operate in both directions.

    • e.g. Health of your roommate, you, your coworker
  • General Case

    • Unnormalized Probability Distribution for Undirected Models
      • $\tilde{p}(\mathbf{x})=\Pi_{C\in \mathcal{G}} \phi(C)$
        • C is clique: a subset of nodes connected to each other.
        • $\phi(C)$ is clique potential

16.2.3 The Partition Function

  • Normarlized Probability Distribution for Undirected Models
    • $p(x) = \frac{1}{Z}\tilde{p}(\mathbf x)$
      • $Z = \int \tilde{p}(\mathbf x) dx$
  • $Z$ is known as the partition function, a term borrowed from statistical physics.

    • often intractable to compute.
    • $\phi$ must be conducive to computing $Z$ efficiently.
    • We need Approximations (CH 18)
  • Difference btw Directed and Undirected Models

    Directed models are defined directly in terms of probaility distributions from the start, while undirected models are defined more loosely by $\phi$ functions that are then converted into probability distributions.
  • The domain of each of the variables has dramatic effect on the kind of probability distribution that a given set of $\phi$ corresponds to.
  • Often, it is possible to leverage the effect of a carefully chosen domain of a variable in order to obtain complicated behavior from a relatively simple set of $\phi$.
    • e.g. Z diverge or not? $\mathbf{x} \in \mathbb{R}^n or \{ 0,1 \}^n$

16.2.4 Energy-Based Models (EBM)

  • $\tilde{p}(\mathbf{x}) = \exp(-E(\mathbf{x}))$
    • to Make $\forall{\mathbf{x}},\tilde{p}(\mathbf{x}) > 0$
    • $E(\mathbf{x})$ is Energy Function
    • $\tilde{p}(\mathbf{x})$ in EBM is an example of a Boltzmann distribution.
      • Many energy-based models are called Boltzmann machines.
  • Different cliques in the undirected graph correspond to the different terms of the energy function.
    • a EBM is just a special kind of Markov network
    • One can view a EBM a Product of "Experts"(each term)
  • The words such as "Partition function", "Energy" are originally developed by statistical physicists.
  • Not compute $p_\text{model}(\mathbf{x})$ but only $\log \tilde{p}_\text{model}(\mathbf{x})$
    • Free energy with latent variables $\mathbf{h}$
      • $\mathcal{F} = -\log \Sigma_h \exp(-E(\mathbf{x}, \mathbf{h}))$

16.2.5 Separation and D-Separation

  • Which variables indirectly interact?
  • Separation (for Undirected Models)
    • Conditionally independence implied by the graph
    • If (1) No path exists between them or (2) all paths contain an observed variable, => Separated
    • Active/Inactive
      • paths involving only unobserved variables => Active => Not Separated (Dependent)
      • all paths including an observed variable => Inactive => Separated (Independent)
  • d-Separation (for Directed Models)

    • "d-" means "dependence"
    • e.g. Active, Not Separated, Dependent
    • e.g. Active, Not Separated
  • Context-specific Independences

    • Cannot be represented with Existing Graphical Models
    • Independences that are present dependent on the value of some variables in the network.
    • e.g.
      1. a = 0 => b and c are independent
      2. a = 1 => b = c

16.2.6 Converting between Undirected and Directed Graphs

  • We often refer to a specific machine learning model as being undirected or directed.
    • RBM: undirected
    • Sparse coding: Directed
  • But Every probability distribution can be represented by either a Directed or an Undirected
    • No probabilistic model is inherently directed or undirected.
    • Some models are most easily described using a directed (or undirected) graph
  • We should "Choose" which language to use for each task.
    • Which approach can capture the most independences.
    • Which approach uses the fewest edges
  • We may sometimes switch between different modeling languages fwith a single probability distribution
    • Sampling: Directed (Ch16.3)
    • Approximate Inference: Undirect (Ch19)
  • Conversion between Directed and Undirected
    • Converting Directed into Undirected
      • Undirected Model cannot represent "Immorality" of Directed Model
        • Immorality
          • Common child of Multiple not-connected parents in Directed Graph
        • Moralized Graph
          • Undirected Graph for representing independence of Immorality
          • Add undirected edge between Immoral parents.
    • Converting Undirected into Directed
      • Directed Model Cannot represent Undirected Model with a loop of length greater than 3.
      • Chord
        • A connection between any two non-consecutive variables in the loop.
      • Converting to Chordal or Triangulated Graph
        • Remove loops of length greater than 3.
        • Make all loops as Triangular loops
      • WHY?
        • Review "Separated in Undirected" and "d-Separated in Directed"
        • Left: Impossible to convert

16.2.7 Factor Graphs

  • Another way of drawing Undirected Models
    • Resolve an ambiguity in the graphical representation
    • Factor graphcs explicitly represent the scope of each $\phi$.

16.3 Sampling from Graphical Models

  • Ancestral Sampling
    • Use "ONLY" Directed Graphical Models
    • Sort variables $\mathbf{x}_i$ in the graph into a topological ordering.
    • Sample in this order
    • e.g. If $\mathbf{x}_1$ -> $\mathbf{x}_2$ -> $\mathbf{x}_3$ ... -> $\mathbf{x}_n$
      • Sample $\mathbf{x}_1 \sim P(\mathbf{x}_1)$
      • Sample $\mathbf{x}_2 \sim P(\mathbf{x}_2 | \mathbf{x}_1)$
      • ...
      • Sample $\mathbf{x}_n \sim P(\mathbf{x}_n | \mathbf{x}_{n-1})$
  • Ancestral Sampling for Undirected Models
    • Preprocessing: Converting Undirected to Directed
    • Some undirected Models cannot be converted into Directed
    • Some converted Models becomes intractable.
  • Gibb Sampling for Undirected Models (Ch17)
    • Use separation properties
    • Iteratively visit each vaiable $\mathbf{x}_i$
    • Not a fair sample $\mathbf{x} \sim P(\mathbf{x})$
    • Focus only one variable and "Local" condition on only the neighbors sampling
    • Repeat!!! -> Converges to sampling from the correct distribution
    • But When we stop?

16.4 Advantages of Structured Modeling

  • Reduce the cost of representing probability distributions as well as learning and inference
  • Sampling
  • Easier to develop and debug
    • Separate Representation of knowledge from learning of knowledge or inference given existing knowledge

16.5 Learning about Dependencies

  • Use Latent(or Hidden) Variables $\mathbf{h}$
  • Visible $\mathbf{v}$s with only direct connections VS $\mathbf{v}$s with indirect connections using $\mathbf{h}$
    • Many edges VS small edges
    • Large cliques VS small cliques
    • Computation cost High VS Low
  • "Structure Learning"
    • Greedy search
      • Propose a structure (with a small number of edges added or removed)
      • Train it, give a score
      • Reward accuracy and penalize model complexity
      • Repeat
    • Using $\mathbf{h}$ avoids Descrete searchs and multiple rounds of training. (Effective)
      • A fixed structure over $\mathbf{v}$, $\mathbf{h}$ enough!!!
  • $\mathbf{h}$ also proviede an alternative representation for $\mathbf{v}$.
    • Richer descriptions of the input

16.6 Inference and Approximate Inference

  • Inference Problem
    • Predict the value of some variables given other variables
    • Predict the probability distribution over some variables given the value of other variables
  • Unfortunately inference problems are ...
    • Intractable
      • a structured graphical model is Not enough to allow efficient inference
    • Computing Marginal probability of a general graphical model is NP Hard. ( Very Hard to solve )
  • Use approximate distribution $q(\mathbf{h}|v) \approx p(\mathbf{h}|v)$ (Ch19)

16.7 The Deep Learning Approach to Structured Probabilistic Models

  • Deep Graphical Models VS Traditional Graphical Models
  • Different design decisions for Deep Learning
    • Different algorithms
    • Different models
  • The depth of a model in Deep Graphical Models
    • Depth of latent varaible $\mathbf{h}$ is...
      • The shortest path from $\mathbf{h}$ to an observed variable.
  • Deep Learning models...
    • Use Distributed Representations
    • Typically have more latent variables than observed variables.
    • Focus Indirect effect
      • Capture Nonlinear interactions via Indirect connections between multiple latent variables.
  • Traditional Graphical Models

    • Focus Direct effect
      • Capture Nonlinear interactions using High order term and Structure Learning between variables.
      • Use only few latent variables
  • Design of Latent variables in Deep Graphical Models

    • The latent variables do not have any specific semantics
    • Usually not very easy for a numan to inteprete
    • c.f. In the traditional models...
      • Latent variables are designed with some specific semantics in "human" mind.
      • Less able to scale to complex problems
      • Not reusable
  • Connectivity typically used in Deep Graphical Models

    • Have large groups of units that are all connected to other groups of units.
      • Interactions between two groups may be described by a single matrix.
      • c.f. In the traditional models...
        • few connections and the choice of connections for each variable.
  • Training Algorithms is "free" to model a particular dataset.

    • c.f. In the traditional models...
      • The choice of inference algorithm is "tightly linked" with the design of the model structure.
  • Traditional approaches typically aim to "Exact inference".

    • Or use "loopy belief propagation" for approximate inference algorithm. (Murphy Ch20)
    • c.f. In Deep Graphical Model..
      • Gibbs sampling or variational inference algorithms.
  • Both approaches work well with very Sparsely connected graphs (using exact inference or loopy belief propatation).

    • In case that the graphs are not spase enough
      • exact inference or loopy belief propagation are not relevant.
  • Deep Graphical Models In the view of Computation

    • A very large latent variables makes efficient numerical code essential.
      • => implemented with efficient matrix product operations
      • => sparsely connected generalizations
        • block diagonal matrix products or convolutions
      • c.f. Traditional Model use one big matrix.(?)
  • Trend: The deep learning approach is often...

    • Figure out what the minimum amount of information we absolutely need
    • Figure out how to get a reasonable approximation of that information as quickly as possible.
    • c.f. Traditional approach ...
      • Simplifying the model until computing exactly.
    • Increase power of the model until it is barely possible to train or use.
      • We train Model!!!
        • Marginal distributions cannot be computed
          • However Satisfied to draw approximate samples
        • Objective function is intractable.
          • However have an estimate of the gradient.

16.7.1 Example: The Restricted Boltzmann Machine

  • Divied into 2 groups $\mathbf{v}$, $\mathbf{h}$
    • No direct interactions between any $\mathbf{v}$ or any $\mathbf{h}$ (Restricted)
  • Used for Deep learning
  • Energy function (Induced in Ch20.2) $$\tilde{p}(\mathbf{v,h}) = \exp(-E(\mathbf{v,h}))$$ $$p(\mathbf{v,h}) = \frac{1}{\mathbf{Z}} \exp(-E(\mathbf{v,h}))$$ $$E(\mathbf{v},\mathbf{h})= -\mathbf{b}^{T} \mathbf{v}-c^{T} \mathbf{h} - \mathbf{v}^{T}\mathbf{W}\mathbf{h}$$
    • Properties $$p(\mathbf{h}|\mathbf{v})=\Pi_i p(\mathrm{h}_i| \mathbf{v})$$ $$p(\mathbf{v}|\mathbf{h})=\Pi_i p(\mathrm{v}_i| \mathbf{h})$$ $$P(\mathbf{h}_i = 1| \mathbf{v}) = \sigma(\mathbf{v}^{T} \mathbf{W}_{:,i} + \mathrm{c}_{i})$$ $$P(\mathbf{h}_i = 0| \mathbf{v})= 1 - \sigma(\mathbf{v}^{T} \mathbf{W}_{:,i} + \mathrm{c}_{i})$$ $$\frac{\partial}{\partial \mathbf{W}_{i,j} }E(\mathbf{v}, \mathbf{h}) = -\mathrm{v}_i \mathrm{h}_j$$

In [ ]: